Search CORE

170 research outputs found

OpenCL Actors - Adding Data Parallelism to Actor-based Programming with CAF

Author: A Klöckner
D Charousset
G Agha
G Agha
J Nickolls
JD Owens
K Wu
L Dagum
S Srinivasan
S Wienke
T Desell
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

The actor model of computation has been designed for a seamless support of concurrency and distribution. However, it remains unspecific about data parallel program flows, while available processing power of modern many core hardware such as graphics processing units (GPUs) or coprocessors increases the relevance of data parallelism for general-purpose computation. In this work, we introduce OpenCL-enabled actors to the C++ Actor Framework (CAF). This offers a high level interface for accessing any OpenCL device without leaving the actor paradigm. The new type of actor is integrated into the runtime environment of CAF and gives rise to transparent message passing in distributed systems on heterogeneous hardware. Following the actor logic in CAF, OpenCL kernels can be composed while encapsulated in C++ actors, hence operate in a multi-stage fashion on data resident at the GPU. Developers are thus enabled to build complex data parallel programs from primitives without leaving the actor paradigm, nor sacrificing performance. Our evaluations on commodity GPUs, an Nvidia TESLA, and an Intel PHI reveal the expected linear scaling behavior when offloading larger workloads. For sub-second duties, the efficiency of offloading was found to largely differ between devices. Moreover, our findings indicate a negligible overhead over programming with the native OpenCL API.Comment: 28 page

arXiv.org e-Print Archive

Crossref

REPOSIT

Optimization of Convolutional Neural Network ensemble classifiers by Genetic Algorithms

Author: A Krizhevsky
A Krizhevsky
A Vishnuvarthanan
FA Spanhol
J Nickolls
MA Molina-Cabello
MA Molina-Cabello
OK Bagui
PA Nogueira
R Davis
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2019
Field of study

Breast cancer exhibits a high mortality rate and it is the most invasive cancer in women. An analysis from histopathological images could predict this disease. In this way, computational image processing might support this task. In this work a proposal which employes deep learning convolutional neural networks is presented. Then, an ensemble of networks is considered in order to obtain an enhanced recognition performance of the system by the consensus of the networks of the ensemble. Finally, a genetic algorithm is also considered to choose the networks that belong to the ensemble. The proposal has been tested by carrying out several experiments with a set of benchmark images.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

Crossref

Repositorio Institucional Universidad de Málaga

Collaborative Layer-wise Discriminative Learning in Deep Neural Networks

Author: C Cortes
C Farabet
C Xu
DR Cox
GE Hinton
J Nickolls
MD Zeiler
N Srivastava
O Russakovsky
P Viola
RO Duda
SS Haykin
Y Bengio
Y Wei
Publication venue
Publication date: 19/07/2016
Field of study

Intermediate features at different layers of a deep neural network are known to be discriminative for visual patterns of different complexities. However, most existing works ignore such cross-layer heterogeneities when classifying samples of different complexities. For example, if a training sample has already been correctly classified at a specific layer with high confidence, we argue that it is unnecessary to enforce rest layers to classify this sample correctly and a better strategy is to encourage those layers to focus on other samples. In this paper, we propose a layer-wise discriminative learning method to enhance the discriminative capability of a deep network by allowing its layers to work collaboratively for classification. Towards this target, we introduce multiple classifiers on top of multiple layers. Each classifier not only tries to correctly classify the features from its input layer, but also coordinates with other classifiers to jointly maximize the final classification performance. Guided by the other companion classifiers, each classifier learns to concentrate on certain training examples and boosts the overall performance. Allowing for end-to-end training, our method can be conveniently embedded into state-of-the-art deep networks. Experiments with multiple popular deep networks, including Network in Network, GoogLeNet and VGGNet, on scale-various object classification benchmarks, including CIFAR100, MNIST and ImageNet, and scene classification benchmarks, including MIT67, SUN397 and Places205, demonstrate the effectiveness of our method. In addition, we also analyze the relationship between the proposed method and classical conditional random fields models.Comment: To appear in ECCV 2016. Maybe subject to minor changes before camera-ready versio

arXiv.org e-Print Archive

Crossref

Fast Parallel Suffix Array on the GPU

Author: E Lindholm
J Nickolls
JA Edwards
NJ Larsson
V Osipov
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/08/2015
Field of study

Crossref

eScholarship - University of California

Optimistic Parallelism on GPUs

Author: E Ayguadé
J Nickolls
JE Stone
L Dagum
S Liu
S Wienke
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 05/03/2015
Field of study

Abstract. We present speculative parallelization techniques that can exploit parallelism in loops even in the presence of dynamic irregulari-ties that may give rise to cross-iteration dependences. The execution of a speculatively parallelized loop consists of five phases: scheduling, com-putation, misspeculation check, result committing, and misspeculation recovery. While the first two phases enable exploitation of data paral-lelism, the latter three phases represent overhead costs of using specu-lation. We perform misspeculation check on the GPU to minimize its cost. We perform result committing and misspeculation recovery on the CPU to reduce the result copying and recovery overhead. The scheduling policies are designed to reduce the misspeculation rate. Our program-ming model provides API for programmers to give hints about potential misspeculations to reduce their detection cost. Our experiments yielded speedups of 3.62x-13.76x on an nVidia Tesla C1060 hosted in an Intel(R) Xeon(R) E5540 machine.

CiteSeerX

Crossref

MemShield: GPU-assisted software memory encryption

Author: A Würstlein
J Bauer
J Lin
J Nickolls
JA Halderman
M Henson
M Huber
M Zhang
P Papadopoulos
R Stoyanov
S Dey
S Maitra
S Vömel
S Vömel
Y Chen
Z Wang
Publication venue
Publication date: 20/04/2020
Field of study

Cryptographic algorithm implementations are vulnerable to Cold Boot attacks, which consist in exploiting the persistence of RAM cells across reboots or power down cycles to read the memory contents and recover precious sensitive data. The principal defensive weapon against Cold Boot attacks is memory encryption. In this work we propose MemShield, a memory encryption framework for user space applications that exploits a GPU to safely store the master key and perform the encryption/decryption operations. We developed a prototype that is completely transparent to existing applications and does not require changes to the OS kernel. We discuss the design, the related works, the implementation, the security analysis, and the performances of MemShield.Comment: 14 pages, 2 figures. In proceedings of the 18th International Conference on Applied Cryptography and Network Security, ACNS 2020, October 19-22 2020, Rome, Ital

arXiv.org e-Print Archive

Crossref

ART

Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs

Author: A Hart
D Dutykh
G Ruetsch
I Karlin
IZ Reguly
J Gong
J Nickolls
JE Stone
M Martineau
M Norman
MB Giles
S Wienke
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Efficiently exploiting GPUs is increasingly essential in scientific computing, as many current and upcoming supercomputers are built using them. To facilitate this, there are a number of programming approaches, such as CUDA, OpenACC and OpenMP 4, supporting different programming languages (mainly C/C++ and Fortran). There are also several compiler suites (clang, nvcc, PGI, XL) each supporting different combinations of languages. In this study, we take a detailed look at some of the currently available options, and carry out a comprehensive analysis and comparison using computational loops and applications from the domain of unstructured mesh computations. Beyond runtimes and performance metrics (GB/s), we explore factors that influence performance such as register counts, occupancy, usage of different memory types, instruction counts, and algorithmic differences. Results of this work show how clang's CUDA compiler frequently outperform NVIDIA's nvcc, performance issues with directive-based approaches on complex kernels, and OpenMP 4 support maturing in clang and XL; currently around 10% slower than CUDA

arXiv.org e-Print Archive

Crossref

Warwick Research Archives Portal Repository

Repository of the Academy's Library

GPU-Based Data Processing for 2-D Microwave Imaging on MAST

Author: BITTNER R.
CASTRO R.
DAVIS W. M.
EIDIETIS N. W.
FREETHY S. J.
FREETHY S. J.
GARELLI N.
HUANG B. K.
LUJAN P.
MONTEIRO E.
NAVARRO C. A.
NAYLOR G. A.
NICKOLLS J.
OWENS J. D.
PELL O.
SALMON N. A.
SHEVCHENKO V. F.
SHEVCHENKO V. F.
THOMAS D. A.
THOUTI K.
URBAN J.
VAN CITTERT P. H.
VERMIJ E.
WYNTERS E.
XU C.
YANG L.
YUE X.
ZERNIKE F.
Publication venue: 'American Nuclear Society'
Publication date: 01/05/2016
Field of study

The Synthetic Aperture Microwave Imaging (SAMI) diagnostic is a Mega Amp Spherical Tokamak (MAST) diagnostic based at Culham Centre for Fusion Energy. The acceleration of the SAMI diagnostic data-processing code by a graphics processing unit is presented, demonstrating acceleration of up to 60 times compared to the original IDL (Interactive Data Language) data-processing code. SAMI will now be capable of intershot processing allowing pseudo-real-time control so that adjustments and optimizations can be made between shots. Additionally, for the first time the analysis of many shots will be possible

Crossref

White Rose Research Online

DecGPU: distributed error correction on massively parallel graphics processing units using CUDA and MPI

Author: B Langmead
B Schmidt
Bertil Schmidt
BH Bloom
Douglas L Maskell
DR Zerbino
E Lindholm
EW Myers
H Shi
H Shi
J Butler
J Nickolls
J Schröder
JC Dohm
JT Simpson
L Fan
L Salmela
MJ Chaisson
P Havlak
PA Pevzner
R Li
RL Warren
S Batzoglou
WR Jeck
X Huang
Y Liu
Y Liu
Y Liu
Y Liu
Yongchao Liu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Next-generation sequencing technologies have led to the high-throughput production of sequence data (reads) at low cost. However, these reads are significantly shorter and more error-prone than conventional Sanger shotgun reads. This poses a challenge for the <it>de novo </it>assembly in terms of assembly quality and scalability for large-scale short read datasets. Results We present DecGPU, the first parallel and distributed error correction algorithm for high-throughput short reads (HTSRs) using a hybrid combination of CUDA and MPI parallel programming models. DecGPU provides CPU-based and GPU-based versions, where the CPU-based version employs coarse-grained and fine-grained parallelism using the MPI and OpenMP parallel programming models, and the GPU-based version takes advantage of the CUDA and MPI parallel programming models and employs a hybrid CPU+GPU computing model to maximize the performance by overlapping the CPU and GPU computation. The distributed feature of our algorithm makes it feasible and flexible for the error correction of large-scale HTSR datasets. Using simulated and real datasets, our algorithm demonstrates superior performance, in terms of error correction quality and execution speed, to the existing error correction algorithms. Furthermore, when combined with Velvet and ABySS, the resulting DecGPU-Velvet and DecGPU-ABySS assemblers demonstrate the potential of our algorithm to improve <it>de novo </it>assembly quality for <it>de</it>-<it>Bruijn</it>-graph-based assemblers. Conclusions DecGPU is publicly available open-source software, written in CUDA C++ and MPI. The experimental results suggest that DecGPU is an effective and feasible error correction algorithm to tackle the flood of short reads produced by next-generation sequencing technologies.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory

Author: Allen R.
Frank M.
Jeff Hao
Minh C. C.
Mojtaba Mehrara
Nickolls J.
Po-Chun Hsu
Scott Mahlke
Zhong H.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref